{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3 神经网络优化算法\n",
    "**梯度下降算法**主要用于优化单个参数的取值，而**反向传播算法**给出了一个高效的方式在所有参数上使用梯度下降算法，从而使神经网络模型在训练数据上的损失函数尽可能小。通过参数的梯度和学习率，参数的更新公式为：\n",
    "\n",
    "$$ \\theta_{n+1} = \\theta_{n} - \\eta\\frac{\\partial}{\\partial\\theta_{n}}J(\\theta_{n})$$\n",
    "\n",
    "其中$\\theta$为参数，下表表示迭代轮数，$\\eta$表示学习率（learning rate），$J(\\theta)$表示损失函数。下面假设我们要最小化函数  $y=x^2$, 选择初始点   $x_0=5$，学习率为0.3，这个优化过程可以总结如下，可以看到，只经过5轮迭代，参数$x$的值已经比较接近最优值0了：\n",
    "\n",
    "| 轮数 | 当前轮参数值 |     梯度\\*学习率   |   更新后的参数值   |\n",
    "| ---- |   -----    |      -------      |    -------       |\n",
    "|   1 |      5    |     2\\*5\\*0.3=3    |       5-3=2     |\n",
    "|   2 |      2    |    2\\*2\\*0.3=1.2    |     2-1.2=0.8    |\n",
    "|   3 |     0.8   |  2\\*0.8\\*0.3=0.48    |    0.8-0.48=0.32   |\n",
    "|   4 |    0.32    |   2\\*0.32\\*0.3=0.192 |   0.32-0.192=0.128    |\n",
    "|   5 |   0.128    |  2\\*0.128\\*0.3=0.0768 | 0.128-0.0768=0.0512     |\n",
    "\n",
    "**神经网络的优化可以分为两个阶段：**\n",
    "- 第一阶段先前向传播计算得到预测值，并将预测值和真实值做对比得到两者之间的差距；\n",
    "- 第二阶段通过反向传播计算损失函数对每一个参数的梯度，再根据梯度和学习率使用梯度下降算法更新每一个参数。\n",
    "\n",
    "梯度下降算法注意点：\n",
    "- 梯度下降算法并**不能保证被优化的函数达到全局最优解**，有可能只是局部最优，也因此参数的初始值很大程度上影响最后得到的结果。**只有当损失函数为凸函数时，梯度下降算法才能保证达到全局最优解。**\n",
    "\n",
    "- 梯度下降算法的另一个问题就是**计算时间太长**，因为在每轮迭代中$J(\\theta)$是在所有训练数据上的损失和。为加速，可以使用随机梯度下降（stochastic gradient descent），每轮迭代中随机优化一条训练数据的损失函数，但是每条数据代表整体数据的能力太差。实际使用时，采用折中——**每次计算一小部分训练数据的损失函数，称为一个batch，通过矩阵运算并不比单个数据慢太多，另一方面可以大大减少收敛所需要的迭代次数，同时可以收敛到的结果更加接近梯度下降的效果。**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.4 神经网络进一步优化\n",
    "### 4.4.1 学习率的设置\n",
    "学习率控制参数的更新幅度。\n",
    "- 如果幅度过大，那么可能导致参数在极优值的两侧来回移动；\n",
    "- 如果幅度过小，虽然能保证收敛性，但是会大大降低优化速度，需要更多轮的迭代才能达到一个比较理想的优化效果。\n",
    "\n",
    "继续刚才的假设：需要最小化函数 $y=x^2$, 选择初始点 $x_0=5$。两种情况分别如下："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**a. 学习率过大（学习率为1，x在5和-5之间震荡）**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "After 1 iteration(s): x1 is -5.000000.\n",
      "After 2 iteration(s): x2 is 5.000000.\n",
      "After 3 iteration(s): x3 is -5.000000.\n",
      "After 4 iteration(s): x4 is 5.000000.\n",
      "After 5 iteration(s): x5 is -5.000000.\n",
      "After 6 iteration(s): x6 is 5.000000.\n",
      "After 7 iteration(s): x7 is -5.000000.\n",
      "After 8 iteration(s): x8 is 5.000000.\n",
      "After 9 iteration(s): x9 is -5.000000.\n",
      "After 10 iteration(s): x10 is 5.000000.\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "TRAINING_STEPS = 10\n",
    "LEARNING_RATE = 1\n",
    "\n",
    "x = tf.Variable(tf.constant(5, dtype=tf.float32), name=\"x\")\n",
    "y = tf.square(x)\n",
    "\n",
    "train_op = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(y)\n",
    "\n",
    "with tf.Session() as sess:\n",
    "    sess.run(tf.global_variables_initializer())\n",
    "    for i in range(TRAINING_STEPS):\n",
    "        sess.run(train_op)\n",
    "        x_value = sess.run(x)\n",
    "        print(\"After %s iteration(s): x%s is %f.\"% (i+1, i+1, x_value) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**b. 学习率过小（学习率为0.001，下降速度过慢，在901轮时才收敛到0.823355）**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "After 1 iteration(s): x1 is 4.990000.\n",
      "After 101 iteration(s): x101 is 4.084646.\n",
      "After 201 iteration(s): x201 is 3.343555.\n",
      "After 301 iteration(s): x301 is 2.736923.\n",
      "After 401 iteration(s): x401 is 2.240355.\n",
      "After 501 iteration(s): x501 is 1.833880.\n",
      "After 601 iteration(s): x601 is 1.501153.\n",
      "After 701 iteration(s): x701 is 1.228794.\n",
      "After 801 iteration(s): x801 is 1.005850.\n",
      "After 901 iteration(s): x901 is 0.823355.\n"
     ]
    }
   ],
   "source": [
    "TRAINING_STEPS = 1000\n",
    "LEARNING_RATE = 0.001\n",
    "\n",
    "x = tf.Variable(tf.constant(5, dtype=tf.float32), name=\"x\")\n",
    "y = tf.square(x)\n",
    "\n",
    "train_op = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(y)\n",
    "\n",
    "with tf.Session() as sess:\n",
    "    sess.run(tf.global_variables_initializer())\n",
    "    for i in range(TRAINING_STEPS):\n",
    "        sess.run(train_op)\n",
    "        if i % 100 == 0: \n",
    "            x_value = sess.run(x)\n",
    "            print(\"After %s iteration(s): x%s is %f.\"% (i+1, i+1, x_value))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "TensorFlow提供了一种更加灵活的学习率设置方法——**指数衰减**，tf.train.exponential_decay。通过这个函数，可以**先使用较大的学习速率来得到一个比较优的解，然后随着迭代的继续逐步减小学习率，使模型在训练后期更加稳定。**\n",
    "<p align='center'>\n",
    "    <img src=images/图4.13.JPG>\n",
    "</p>\n",
    "**c. 使用指数衰减的效果如下：**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "After 1 iteration(s): x1 is 4.000000, learning rate is 0.096000.\n",
      "After 11 iteration(s): x11 is 0.690561, learning rate is 0.063824.\n",
      "After 21 iteration(s): x21 is 0.222583, learning rate is 0.042432.\n",
      "After 31 iteration(s): x31 is 0.106405, learning rate is 0.028210.\n",
      "After 41 iteration(s): x41 is 0.065548, learning rate is 0.018755.\n",
      "After 51 iteration(s): x51 is 0.047625, learning rate is 0.012469.\n",
      "After 61 iteration(s): x61 is 0.038558, learning rate is 0.008290.\n",
      "After 71 iteration(s): x71 is 0.033523, learning rate is 0.005511.\n",
      "After 81 iteration(s): x81 is 0.030553, learning rate is 0.003664.\n",
      "After 91 iteration(s): x91 is 0.028727, learning rate is 0.002436.\n"
     ]
    }
   ],
   "source": [
    "TRAINING_STEPS = 100\n",
    "global_step = tf.Variable(0)\n",
    "LEARNING_RATE = tf.train.exponential_decay(0.1, global_step, 1, 0.96, staircase=True)\n",
    "\n",
    "x = tf.Variable(tf.constant(5, dtype=tf.float32), name=\"x\")\n",
    "y = tf.square(x)\n",
    "# 在minimize中传入global_step将自动更新global_step的参数。从而使得学习率得到相应更新\n",
    "train_op = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(y, global_step=global_step)\n",
    "\n",
    "with tf.Session() as sess:\n",
    "    sess.run(tf.global_variables_initializer())\n",
    "    for i in range(TRAINING_STEPS):\n",
    "        sess.run(train_op)\n",
    "        if i % 10 == 0:\n",
    "            LEARNING_RATE_value = sess.run(LEARNING_RATE)\n",
    "            x_value = sess.run(x)\n",
    "            print(\"After %s iteration(s): x%s is %f, learning rate is %f.\"% (i+1, i+1, x_value, LEARNING_RATE_value))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
